﻿ Designing Test-beds for General Anaphora Resolution Oana Postolache†* and Dan Cristea*♦ †Uni versity of Saarland, Saarbrücken, Germany oana@coli uni-sb de * “Al I Cuza” University, Iaşi, Romania ♦ Romanian Academy - the Iaşi Branch, Institute of Computer Science, Romania dcristea@infoiasi ro Abstract This paper proposes a framework for evaluating coreference resolution systems, taking into account the contribution of subcomponents Our goal is to have a way to quickly identify bottlenecks so that the development effort can focus on the weakest part of the processing chain We describe experiments on the contribution of the module for searching potential referential expressions using two types of input, plain text and Penn Treebank-style syntactically annotated data We propose a metric for evaluating a coreference resolution system when the set of potential referential expressions in the input is not the same as the set of potential referential expressions in the gold describes the two types of corpora we have used and how 1 Introduction the module of detecting the markables works on them In more and more NLP applications, such as Section 4 is a brief overview of the coreference resolution summarization (Alonso and Fuentes, 2003), information system we use In Section 5 we present the methods of extraction (Gaizauskas and Humphreys, 2000; Luo et al , evaluation employed by our test-bed, and in Section 6 we 2004) and question answering (Harabagiu and Maiorano, discuss the results and conclude 2003), anaphora resolution is reported to be a key component In all these systems, generally employing a 2 Evaluation strategy pipe-line architecture, evaluation of the system We believe that the evaluation of a complex NLP system performances should consider the contribution of each should be done in a way that shows the performance component to the improvement or decrease in the contribution of different components This way performance of the overall system Although seen as a bottlenecks can be identified and the development effort component of other larger systems, coreference resolution can be focussed on the weakest part of the chain We have is not an isolated process, since it is often relying on other used such a methodology for the first time during the more basic processes To give an example, in its most development of an incremental discourse parsing and elementary setting, an anaphora resolution (AR) system summarization system In this paper we express the uses potential anaphors among the referential expressions principle arguments and benefits of such a methodology (also called markables) that it tries to link into chains of and show how it is implemented for the evaluation of a coreferences But the detection of these markables is a coreference resolution system process in itself, which is also prone to errors that can We see a system intended to extract the coreference impact the performance of the AR resolution system chains as being built out of two modules in a pipe-line: There are few attempts to study the performance of an one that detects referential expressions (RE), which are AR-systems in correlation to other more basic processes, potential anaphors (we will refer to this module as the such as the automatic detection of markables The RE-extractor), and one that creates the coreference majority of approaches consider that the gold coreference chains (AR-engine) chains and the test coreference chains share the same set The input to the RE-extractor can be any kind of text, of markables For instance, in a recent approach dedicated annotated or not As we will see in Section 3, the RE- to the evaluation of the coreference task, Popescu-Belis et extractor is responsible for detecting the markables, using al (2004) show that all the measures they implemented information from either some pre-processing phase or rely on the hypothesis that the set of markables in both from manual annotation (if provided) The output of the gold and test files is the same RE-extractor must be, however, for all types of input, a In this paper we report on new experimental data with file conformant with the format accepted by the AR- a model of coreference resolution that works on two engine different types of input (plain text and Penn Treebank-The AR-engine is described briefly in Section 4, and in style syntactically annotated data) To evaluate the greater detail in (Cristea et al , 2002; Postolache, 2004; components of an anaphora resolution system we propose Postolache and Forascu, 2004; Cristea and Postolache, in a methodology, which takes into account the different press) The output of the AR-engine is a partition of the subtasks, and aims to quantify the contribution of the input markables into coreference chains component parts to the overall system performance in a The quality of the final output depends on the quality pipe-line architecture Moreover, the same procedure can of both modules and will manifest successive degradation be applied to the evaluation of larger NLP systems with an To blame, in this context, one module more than the other AR component means to compare their behaviour in a setting in which The structure of the paper is as follows Section 2 their results are not influenced by their interaction in a presents the evaluation strategy we propose, Section 3 chain The proposed test-bed is displayed in Figure 1, where The most important differences from the MUC-7 (P,R,F) represent precision, recall and F-measure of the conventions are that our markables do not include relative RE-extractor, obtained by comparing the results of the clauses, each term of an apposition is taken separately RE-extractor (re-test) against the corresponding gold-([Big Brother], [the primal traitor]), conjoined standard (re-gold) Then the results of the coreference expressions are annotated individually ([John] and [Mary], module (AR-engine) can be measured on the true [hills] and [mountains]), and modifying nouns appearing markables or on the system output markables computed in noun-noun modification are not marked separately by the RE-extractor: when it receives the re-test, it ([glass doors], [prison food], [the junk bond market]) outputs the re-test-coref-test; when it receives the re-gold, In the following subsections we describe the extraction it outputs the re-gold-coref-test In both cases, the of the markables for each of the two types of input evaluation is done against the re-gold-coref-gold (in mentioned which both markables and coreferences are gold) We evaluate the coreference module, using the success rate 3 1 The Orwell corpus (Mitkov, 2002) On one hand, comparing the completely automatic output of the system against the gold standard The corpus consists of chapters 1, 2, 3 and 5 of Orwell’s we obtain success rate SR1, which gives the overall book The text was first POS-tagged, then tagged for performance of the system, while comparing the output of dependency structure (Järvinen and Tapanainen, 1997) the AR-engine, fed with a perfect input, against the gold For detecting the markables, we automatically extracted standard, we obtain success rate SR2, which gives only all structures dominated by a noun or a pronoun, using the the performance of the second module, unaffected by the FDG-structure We imposed the constraint that relative degradation of performance due to the processing of the phrases are not part of a markable The heads of the first module In this way the amelioration of the markables were also automatically extracted (the root of performance of the whole system can be correlated with each such structure) The output of this module was that of its individual modules and bottlenecks can be considered a re-test file identified In order to obtain the re-gold file, a group of However, this scheme is not always easy to implement annotators manually went through the output of the NP- Difficulties to do that will be discussed in section 5 In the detector, looking for errors and eliminating them They next section we describe how the RE-extractor operates on also marked, other markables than NPs or pronouns, two types of input: plain text and text manually annotated which have not been identified by the automatic process The FDG annotation was also used to extract 3 Extracting referential expressions from additional information needed for the anaphora resolution two types of input process such as the lemma of words, the lexical number, and the dependency links between words For our experiments, we use data from two types of Table 1 shows some statistics on the Orwell corpus corpora as input for the RE-extractor: • A plain text corpus of approx 19,500 words in 1,966 sentences, extracted from the Orwell’s novel “1984” Text Text Text Text (Orwell, 1949); 1 2 3 4 • A manually annotated corpus for syntactic structure No of 311 175 169 328 containing approx 6,250 words in 281 sentences, sentences extracted from the English Penn Treebank (Marcus et No of 6935 3317 3260 6008 al , 1994) words Our markables are generally conformant with the MUC-7 No of Res 1942 914 916 1702 (Hirschman and Chinchor, 1997) and ACE (ACE, 2003) Average no criteria, although there are some differences In the of REs per 6 2 5 2 5 4 5 1 following we present the types of mentions that we sentence marked as referential expressions: Pronouns 645 281 362 614 • noun phrases: definite (the principle, the flying No of Des 921 520 464 863 object), indefinite (a book, a future star) or undetermined (sole guardian of truth), singular and Table 1: Statistics on the Orwell corpus REs stands for plural, and including names (Winston Smith, The Ministry of Love), dates (April), currency expressions referential expressions and DEs for discourse entities ($40) and percentages (48%); • pronouns: personal (I, you, he, him, she, her, it, they, 3 2 The Penn Treebank corpus them), possessive (his, her, hers, its, their, theirs), The corpus consists of 7 files (wsj 2428, wsj 2430, reflexive (himself, herself, itself, themselves) and wsj 2431, wsj 2435, wsj 2438, wsj 2443, wsj 2444) To demonstrative (this, that, these, those); obtain the markables we firstly automatically extracted • wh-pronouns: relative pronouns which, who, whom, (using the Penn Treebank annotation structure): the nouns, whose and that when they replace an entity; when they the pronouns, and the noun-phrases We obtained a set of are part of an interrogative sentence they are not markables for which we imposed some constraints in marked as referential expressions; order to eliminate some of the NPs that didn’t correspond • numerals: when they refer to entities (four of them, to what we considered a referential expression The most the first, the second), but not when they have an frequent situation when we had to eliminate a markable adjectival or adverbial function (He was 29, three was in the case of embedded NPs For example, for the books); sentence: The market for $200 billion of high-risk junk bonds, word that is part of a phrase (but not in the head battered by a succession of defaults and huge price position) the link indicates the head For a word that is declines this year, practically vanished Friday the head of a phrase, the link indicates the phrase we extracted automatically the following markables: parent's head Some words - the predicative verbs - 0: The market have no link 1: $ 200 billion • finding the lemma (stem) of words: we use the 2: high-risk junk bonds Wordnet2 0 API in order to extract the stem of words 3: $ 200 billion of high-risk junk bonds Table 2 shows some statistics on the Penn Treebank 4: The market for $ 200 billion of high-risk junk corpus we used bonds 5: a succession file file file file File file file 6: defaults 1 2 3 4 5 6 7 7: a succession of defaults No of 8: huge price declines sentences 83 9 47 10 19 63 50 9: this year 10: huge price declines this year No of 1837 192 1111 150 534 1483 937 11: a succession of defaults and huge price declines words this year No of 533 65 344 52 165 423 262 12: Friday Res The Penn Treebank annotation is done in such a way that, Average for example, given the construction A of B, then A is a NP, no of 6 4 7 2 7 3 5 2 8 6 6 7 5 2 B is an NP and also A of B is an NP The same thing REs per happens also with other types of embedded NPs sentence Conforming to the view that REs should reflect discourse Pronouns 44 5 29 4 11 35 8 entities, our belief is that A alone should not be considered No of an RE and we rule it out such that only two of these Des 483 49 242 30 110 392 240 markables remain, namely B and A of B Table 2: Statistics on the Penn Treebank corpus Thus, for the above example, we consider as markables the following phrases: 4 The AR-engine and the coreference model 2: high-risk junk bonds 3: $ 200 billion of high-risk junk bonds Our coreference model is integrated in the anaphora 4: The market for $ 200 billion of high-risk junk resolution framework of (Cristea and Dima, 2001) bonds Anaphoric references in this approach are processed on a 6: defaults three-layered representation The text layer retains the 7: a succession of defaults markables (REs) from the surface text Below it, on the 8: huge price declines projection layer, each markable is represented as a 9: this year projected structure (PS) with all features that are relevant 12: Friday for the resolution process Deeper, on the semantic layer, We also eliminated the relative clauses from markable the discourse entities (DEs) are represented One spans and we considered the terms of an apposition processing cycle of the engine deals with the resolution of construction as being two REs (in the Penn Treebank the one RE and goes through three compulsory phases and an whole apposition construction is also considered a NP) optional one: during the projection phase the current PS After running these scripts that eliminated some erroneous is built on the projection layer using the information markables we obtained the corresponding re-test files centred on the current RE obtained from the text layer; the As in the case of the Orwell corpus, annotators proposing/evoking phase is responsible for matching the corrected the errors and introduced new markables in current PS with one DE, by either proposing a new order to obtain the re-gold files discourse entity or deciding on the best candidate from the After detecting the markables we automatically enriched existent ones; in the completion phase, the data contained the annotation with basic information needed for in the resolved PS is combined with the data configuring coreference: the found referent; sometimes, an optional re-evaluation • choosing the head of the markables: we have used a phase is triggered to resolve postponed PSs left on the similar approach as the one described in (Collins, projected layer (Cristea and Postolache, in press) Any 1999) We consider all the words on the first level of model under this processing paradigm is seen as a the NP (eliminating the nested NPs) and then search quadruple: a set of primary attributes characterising the from right to left for the first word that has the part-of-descriptions on the three layers, a set of knowledge speech equal with NN, NNS, NP, NPS, CD, POS, or sources capable of filling values for the primary attributes JJS1 of the PSs on the projection layer with information • finding the dependency links between words: we gathered from the text layer, a set of heuristics or rules consider that every word is linked to the head of the intended to answer one or both of the following two phrase in which it belongs We use Collins’ rules for questions, in this order: (1) Does a PS introduce a new detecting the head of the phrases, and then, for each discourse entity? (2) If not, which of the existing DE does it co-refer with?, and a domain of referential accessibility, which can be implemented also as a set of 1 NN = noun, singular or mass; NNS = noun, plural; NP = proper heuristics/rules able to filter and order the list of noun, singular; NPS = proper noun, plural; CD = cardinal candidates DEs In accordance with most authors, three number; POS = possessive ending; JJS = adjective, superlative types of rules are involved in the proposing/evoking phase: WordNet in order to find a synset that contains both demolishing rules, which describe hard constraints lemmas If such a synset is found, the rule returns 1, else 0 prohibiting a coreference link if certain conditions apply, HypernymyRule is similar to the SynonymyRule certifying rules, which license a coreference link only that it looks into hypernyms instead of synonyms regardless of other criteria, and promoting/demoting WordnetChainRule looks in WordNet for a rules, which increase/decrease a score contributed by a hypernymic or hyponymic chain between the lemma certain attribute and associated with a pair (PS, DE) In attribute of its first argument, the PS, and each of the our model component one consists of the following set of lemmas of its second argument, the DE If such a chain is primary attributes, associated with REs and further found, the rule return 1, else 0 Since it is possible for the projected onto PSs and DEs: lemma (the lemma), two lemmas, corresponding to the anaphor and candidate number (lexical number), pos (part-of-speech), role antecedent, to have different meanings in the text (in (the syntactic role of the RE in the sentence), all of these which case the PS corresponding to the anaphor RE does characterizing the head word of the corresponding RE, not refer the candidate DE), we attached a relatively low link (syntactic dependency), npText (the span of text weight to all the WordNet accessing rules This way an covered by the RE), includedNPs (the set of nested accidental synonymy, hypernymy or chain matching will REs in the current RE), isDefinite (yes, if the RE is a not influence drastically the final score, but it could be definite markable), isUndefinite (yes, if the RE is exploited when conjoined with other rules Also, since an indefinite markable), isMaleName, there are cases when a DE has a set of different values for isFemaleName, isFamilyName (with the evident an attribute, because it incorporates the values triggered meaning, decided based on big files with male, female and from all the REs that refer to it, a scoring rule tests the family names), HeSheItThey (the probability of the inclusion of the value of a PS attribute within the set of corresponding RE to be referred to by a he, she, it or they, values of the corresponding DE attribute For more details computed from WordNet), offset (from the beginning about the coreference model, its attributes and rules see of the text), sentenceID (the ID of the sentence the RE also (Postolache and Forascu, 2004) belongs to), isPerson (if it is a person or not according to WordNet), predNameOf (the ID of the subject in case 5 Evaluation the RE is a predicative noun), headForm (the form of the head – not the lemma – in order to distinguish between Figure 1 displays the processing chain for a coreference they and themselves) The second component implements resolution task, the input, the output and the intermediate three types of rules The demolishing rules section, results, as well as the gold files against which the results intended to rule out a possible DE as referent candidate of of the processing steps are compared The evaluation a PS, includes just one rule: IncludingRule which results are shown between the two main processing steps, prohibits coreference between nested REs The certifying performed by the RE-extractor and the AR-engine, as well rules, which if evaluated to 'true' on a pair (PS, DE) certify as at the end of the processing chain As can be seen in without ambiguity the DE as a referent of the PS, include this figure, to evaluate the overall process of coreference PredNameRule and ProperNameRule resolution and the intermediary steps, three types of PredNameRule says that if the current RE is on the evaluations have been made: position of a predicative noun then that DE corresponding 1 the module for the extraction of the markables (RE- to the RE on the position of the subject is certified as an extractor) is evaluated; antecedent In establishing this coreference link between 2 the coreference module is evaluated when both the test predicative noun and subject we deliberately disregarded file and the gold file have the same sets of markables in this phase of our model relative identities as state (the gold set); transitions, eventualities, etc ProperNameRule 3 in order to measure the influence of the RE-extractor certifies as antecedent a DE displaying the same proper over the overall process, the coreference model is name as the current RE Promoting/demoting rules are evaluated by comparing a test file that has as applied after the certifying and demolishing rules and markables the output of the RE-extraction module increase/decrease a resolution score associated with a pair against the gold coreferences file build on the gold set (PS, DE) In our model we implemented 8 such rules of markables HeSheItTheyRule gives the probability that the In what follows we discuss methods of evaluation for candidate DE can be referred by the current RE, a he, she, these three steps it or they pronoun RoleRule distinguishes cases of very close scores among antecedents It returns values between 5 1 Evaluating the referential expressions 0 and 1 according to the syntactic role of the antecedents, extractor where the role ranking is consistent with the ranking used For both types of the input (plain text data and in the Centering Theory (Grosz et al, 1995), namely syntactically annotated data) the RE-extractor produces subject > direct object > indirect object > attributive the same format accepted by the AR-engine One of the NumberRule prefers antecedents agreeing in the constraints that the AR-engine imposes on its input is that number attribute with the current RE LemmaRule is markables should have set the head and also there should the place where constraints about the word components of not exist two different markables having the same head the NPs are implemented PersonRule implements the We evaluated the RE-extractor, comparing the re-test preference of pronoun REs like themselves to point files against the re-gold files In Figure 1, the results are person-type antecedents Besides these, three other rules displayed in terms of precision (P), recall (R) and F- implement semantic and lexical distance preferences measure (F) for both types of input We have used two based on WordNet SynonymyRule uses the lemmas of techniques for evaluation: head matching, where two its arguments (a PS and a DE) and interrogates the markables were considered to match if they have the same Orwell-HM Orwell-PM P=0,85 P=0,74 Orwell R=0,80 SR1=0,55R=0,94 F=0,89 F=0,76 Orwell input RE-extractor re-test re-test-coref-test SR2=0,66 AR-engine re-gold PTB re-gold-coref-test PTB-PM SR1=0,61 P=0,78 PTB-HM R=0,82 re-gold-coref-gold PTB P=0,90 F=0,79SR2=0,69 R=0,95 F=0,92 Figure 1: An example of processing and its evaluation points: Orwell stands for the “1984”corpus, PTB – for Penn Treebank corpus, HM – for head matching, PM – partial matching, SR1 is the Success Rate for coreference resolution when the sets of markables are different and SR2 is the Success Rate when the sets of markables are the same head, and we computed precision, recall and F-measure; We will then sum up these values for the whole set of and partial matching, in which we considered that two anaphor and divide by the cardinality of this set markables match if they have the same head and the For the third type of evaluation needed, when the sets minimal mutual overlap was higher than 50% The of markables in the test file and the gold file are different, overlapping score was computing dividing the number of there are three possibilities, as follows: there is a common set of markables (on the identity of common words over the size of the longest markable out • of the two The results are shown in Figure 1, where the head criterion), but which could possibly cover dark squared diagrams have titles that indicate the original different spans of text; in the test there are markables which have no corpora and the type of evaluation They show, as • expected, that when we imposed a lower bound of 50% on corresponding markables in the gold standard (false the mutual overlap of markables the performance of the alarms) and vice-versa (misses); the test file and the gold file have different chains of extractor is poorer • markables which refer to the entities in the text, 5 2 Evaluating the AR-engine different with respect not only to the partition of the The metric used to evaluate the AR-engine was based on common markables, but also with respect to the the success rate (the ratio between the number of correctly markables of the chains (for example, the test chains resolved anaphors and the number of all anaphors) defined may contain some markables that are not even in (Mitkov, 2002) We have considered each of the considered as markables in the gold) referential expressions that we have marked as a potential We described above the way of evaluating the coreference anaphor and instead of deciding whether a certain anaphor chains when the two sets of markables are the same To was correctly resolved or not, for each anaphor we assign extend this method to the case when the two sets of a correctness value between 0 (meaning incorrectly markables are the same (on the identity of head criterion) solved) and 1 (meaning correctly solved) We explain but the markables’ spans are different, we adjust the below how this value is computed when the sets of correctness score as follows In counting the scores markables in the text file and the gold file are identical corresponding to trivial chains anaphors (a singleton Since our output consists of coreference chains (sets of coreference chain consisting of a single referential anaphors which refer to the same entity), we must expression) we replace the 1 values in the sum (meaning compare the gold set of chains (obtained from the correctly solved) with the ratio between the number of manually annotated corpus) with the system output chains common tokens and the length of the longest markable, (we will call it the test set) For each anaphor: giving a mutual overlapping score between the two • if, in the gold set, it belongs to a chain that doesn’t markables In counting the scores corresponding to contain any other anaphor, then we look in the test set anaphors belonging to non-trivial chains (longer than one to see if it belongs to a similar trivial chain, in which markable), the m value is no more the number of common case it will get the value 1; else (the test chain markables in the test and gold chain, but the sum of the corresponding to the current anaphor contains more mutual overlapping scores (as above) between markables than one anaphors) it will get the value 0; belonging to test and gold chains • if, in the gold set, the anaphor belongs to a chain For the other cases, when there are differences containing other n anaphors, then we look in the test between the two sets of markables (also identified by the set and count how many of these n anaphors belong to identity of head criterion), in the computation of the the chain corresponding to the current anaphor (we correctness score, the non-intersecting markables in the note this number with m) The ratio m/n will be the test and gold files will not contribute to the sum, thus value assigned to the current anaphor deteriorating the overall rating 6 Conclusions References Our results show (as we expected) that the overall ACE 2003 Entity Detection and Tracking – Phase 1, ACE performance of the AR system is affected by the Pilot study definition available from the ACE site performance of the previous modules In the case of a http://www itl nist gov/iaui/894 01/tests/ace/phase1/ind coreference resolution system, the main point, which can ex htm influence the results, is the detection of the markables In Alonso, L and M Fuentes, 2003 Integrating Cohesion our test-bed, we obtained a difference of 0,11 (in the case and Coherence for Text Summarization Proceedings of of the plain text input), and 0,09 (in the case of the the EACL Student Session, Budapest, Hungary annotated corpus) between the performance of the AR-Collins, M , 1999 Head-Driven Statistical Models for engine when run with gold markables and the Natural Language Parsing PhD Thesis, University of performance of it when run with test markables For sure Pennsylvania the most part of the blame must be put on the RE-detector, Cristea, D and G E Dima, 2001 An Integrating but not only: other pre-processing phases may also have Framework for Anaphora Resolution Information ‘helped’ at the deterioration of the final score For Science and Technology, Bucharest 4:3-4, 273-291 example, all models of AR rely heavily on the information Cristea, D, O Postolache, G E Dima and C Barbu, 2002 extracted from the heads of the markables, that means that AR-Engine – a framework for unrestricted coreference the script for finding the head of a markable also have a resolution Proceedings of LREC 2002, Las Palmas, vol responsibility VI, 2000-2007 An interesting thing to observe is the difference Cristea, D and O Postolache, 2004 How to deal with between the values obtained for the two types of corpus wicked anaphora In A Branco, T McEnery and R we worked with In the case of the plain text we obtained Mitkov (eds), Anaphora Processing: linguistic, smaller values when evaluating the RE-extractor, than in cognitive and computational modeling Amsterdam: the case of the manually annotated treebank One John Benjamins (to appear) explanation of this fact is enforcing our belief: we have Gaizauskas, R and K Humphreys, 2000 Quantitative used an automatic pre-processing phase (the FDG-parser) Evaluation of Coreference Algorithms in an in order to obtain the syntactic trees For sure, the output Information Extraction System In S Botley and T of the FDG-parser is not perfect, or at least it doesn’t have McEnery (eds ), Corpus-based and Computational the same quality as the manually annotated corpus So this Approaches to Discourse Anaphora Amsterdam/New- weak point in the extraction of the markable (in the case York: John Benjamins of the plain text corpus) had impact on the performance of Harabagiu, S and S Maiorano, 2003 Relational Meaning the RE-extractor Used for Processing Complex Questions Proceedings Also the performance of the AR-engine is better in the of the International Symposium on Reference case of the Penn Treebank corpus At least two reasons Resolutionand Its Applications to Question Answering can be thought of: 1) the annotation is better; 2) Penn and Summarization, Venice, Italy Treebank consists of newspapers and the task of Hirschman, L and N Chinchor, 1997 MUC-7 coreference resolution on newspapers seems to be easier Coreference Task Definition, version 3 0 MUC-7 than on belles-lettres register Proceedings See also: http://www muc saic com The original work reported in this paper is the Luo, X , A Ittycheriah, H Jing, N Kambhatla, S Roukos, following: 2004 A Mention-Synchronous Coreference Resolution - it proposes a methodology to evaluate pipe-line Algorithm Based on Bell Tree Proceeding of ACL, architectures when the gold and test data are available in-Barcelona, Spain between intermediate steps in the processing chain The Järvinen, T and P Tapanainen, 1997 A dependency method allows to appreciate the contribution of individual Parser for English The Technical Reports of the modules on the overall results irrespective of the Department of General Linguistics, University of depreciation of the results due to the weakness of the Helsinki preceding modules; Marcus, G , G Kim, M A Marcinkiewicz, R MacIntyre, - it reports and compares new anaphora resolution A Bies, M Ferguson, K Katz and B Schasberger, results on input belonging to two different registers: 1994 The Penn Treebank: Annotating predicate belles-lettres and finance, and to two different levels of argument structure Proceedings of Human Language input: plain-text and treebank annotation Technology Workshop Mitkov, R , 2002 Anaphora resolution London: Acknowledgements Longman Part of this work has been accomplished under the IST-Orwell, G , 1949 1984 London: Secker and Warburg 2000 29388 EU project Balkanet and with a support from Popescu-Belis, A , L Rigouste, S Salmon-Alt and L the Romanian Ministry of Education and Research under Romary, 2004 Online Evaluation of Coreference the CORINT programme We are grateful to our Resolution Proceedings of LREC'04 Lisbon, Portugal colleagues from the Laboratory of Computational Postolache, O , 2004 RARE- Robust Anaphora Resolution i Linguistics of the University of Wolverhampton, who Engine Master Thesis, University of Iaş have provided us the FDG parsing of the Orwell text We Postolache, O , C Forascu, 2004 A Coreference would also like to thank Dr Nicolas Nicolov for advice, Resolution Model On Excerpts from a Novel suggestions and comments on an earlier draft of the paper Proceedings of ESSLLI Student Session, Nancy, France 